perf(geotiff): batch _nvcomp_batch_compress allocations and D2H readback (#1712) by brendancol · Pull Request #1729 · xarray-contrib/xarray-spatial

brendancol · 2026-05-12T19:07:33Z

Summary

_nvcomp_batch_compress had two anti-patterns left over after the
decode-side batched D2H fix in #1552 and the device-pointer fix in
#1659:

Per-tile cupy.empty allocations for compressed-output buffers.
For an N-tile write this issued N memory-pool calls. Replace with one
contiguous allocation sized n_tiles * max_cs plus per-tile slab
views; the per-tile pointer array still lets nvCOMP write
independent slabs in parallel. _check_gpu_memory now bounds the
new allocation.
Per-tile .get().tobytes() in the result-collection loop. Each
.get() was a separate D2H transfer on the default stream, and the
per-DMA setup cost dominated wall time for large-N writes (exactly
the pattern Batch GDS-fallback per-tile GPU to host transfer #1552 fixed on the decode side). Replace with one
cupy.concatenate of trimmed slabs and one .get(), then slice
the host buffer by cumulative offsets. The adler32 deflate-wrap step
stays on the host as before.

Real-world benchmark on an Ampere-class GPU: 2048x2048 float32 zstd
GPU write at tile_size=64 (1024 tiles) drops median from 84.3 ms to
54.7 ms (~35% reduction). The decode-side fix in #1552 measured a
similar 256ms -> 38ms drop on the analogous LZW path; this PR brings
the encode side into the same shape.

Closes #1712.

Test plan

test_no_per_tile_cupy_empty_in_compressed_pool and test_no_per_tile_get_in_result_loop: source-level guards that fail loudly if either anti-pattern returns.
test_gpu_write_roundtrip_after_batched_compress[deflate] and [zstd]: end-to-end GPU write + read produces the same float32 array, catching any off-by-one in the host-side cumulative offsets.
test_gpu_write_zero_tile_edge_case: tiny 32x32 input still round-trips through the fast path.
Existing 53 GPU writer tests in test_gpu_writer_compression_modes_2026_05_11.py, test_gpu_writer_attrs_1563.py, test_gpu_writer_band_first_1580.py, test_gpu_writer_nan_sentinel_1599.py, test_to_geotiff_gpu_fallback_1674.py, test_nvcomp_from_device_bufs_single_alloc_1659.py all pass.

Copilot

Pull request overview

This PR optimizes the GeoTIFF GPU write path’s nvCOMP batch compression by removing per-tile device allocations and per-tile device→host readbacks, bringing the encode side in line with the already-batched decode patterns.

Changes:

Replace per-tile cupy.empty compressed-output allocations with a single contiguous device pool + per-tile slab views/pointers.
Replace per-tile .get().tobytes() readback with a packed cupy.concatenate(...) + single .get() and host slicing.
Add a new test module intended to prevent regressions back to the per-tile allocation/readback patterns and to round-trip deflate/zstd writes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`xrspatial/geotiff/_gpu_decode.py`	Implements the contiguous compressed-output pool and batched D2H readback in `_nvcomp_batch_compress`.
`xrspatial/geotiff/tests/test_nvcomp_batch_compress_batched_1712.py`	Adds structural/source assertions and end-to-end round-trip tests for the updated batching behavior.

Comments suppressed due to low confidence (2)

xrspatial/geotiff/tests/test_nvcomp_batch_compress_batched_1712.py:106

These round-trip tests don’t currently guarantee that the nvCOMP GPU compress path actually ran: write_geotiff_gpu falls back to CPU codecs when _nvcomp_batch_compress returns None and does not necessarily raise RuntimeError, so the test can pass while never exercising the changed code. To ensure coverage, skip up-front unless nvCOMP is available/usable (e.g., _gpu_decode._get_nvcomp() is not None, and/or assert the writer took the nvCOMP branch via a spy/monkeypatch/strict-mode signal).

    with tempfile.TemporaryDirectory(prefix="nvcomp_batch_1712_") as td:
        path = os.path.join(td, f"roundtrip_{compression}.tif")
        try:
            write_geotiff_gpu(
                darr, path,
                compression=compression,
                tiled=True,
                tile_size=64,
            )
        except RuntimeError as e:
            # nvCOMP may be unavailable in this environment; the writer
            # falls back to CPU and that path doesn't exercise the
            # change. Skip rather than fail.
            pytest.skip(f"nvCOMP unavailable for {compression}: {e}")

xrspatial/geotiff/tests/test_nvcomp_batch_compress_batched_1712.py:124

The docstring says this is a “0-tile compress” edge case, but the test exercises a 32x32 single-tile write/read instead and even notes that n_tiles==0 never occurs via the writer. Either adjust the docstring to match what’s being tested (tiny single-tile round-trip) or add a direct unit-level test that actually hits the n_tiles == 0 branch in _nvcomp_batch_compress (e.g., via a stub nvCOMP handle).

def test_gpu_write_zero_tile_edge_case():
    """A 0-tile compress returns an empty list without indexing into None.

    The cumulative-sum / concat path must short-circuit before
    ``cupy.concatenate`` (which would raise on an empty list). The
    pre-fix loop simply iterated zero times, so the contract is the
    same empty-list output.
    """
    # Direct call into the internal function with n_tiles=0 is
    # awkward because it needs a libnvCOMP handle and matching opts.
    # Instead, exercise the public writer with a tiny single-tile
    # input and confirm the fast path does not crash. Real n_tiles==0
    # never occurs via the writer (every image has at least one tile).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        trimmed = [
+            d_comp_bufs[i][:int(comp_sizes[i])] for i in range(n_tiles)
+        ]
+        d_comp_concat = cupy.concatenate(trimmed)
+        host_concat = bytes(d_comp_concat.get())


+try:
+    import cupy
+    _HAS_CUPY = True
+except Exception:
+    _HAS_CUPY = False
+
+# nvCOMP is the entry point that exercises this code path.
+from xrspatial.geotiff import _gpu_decode
+
+needs_cupy = pytest.mark.skipif(
+    not _HAS_CUPY, reason="CUDA / cupy unavailable on this host"


…ack (#1712) Two related anti-patterns in the GPU compress path remained after the decode-side fix in #1552 and the device-pointer fix in #1659: 1. Per-tile cupy.empty allocations for the compressed-output buffers. For an N-tile write, this issued N memory-pool calls. Replace with one contiguous allocation of size n_tiles * max_cs plus per-tile slab views; the per-tile pointer array still lets nvCOMP write independent slabs in parallel. 2. Per-tile .get().tobytes() in the result-collection loop. Each .get() was a separate D2H transfer on the default stream, and the per-DMA setup cost dominated wall time for large-N writes (the exact pattern #1552 fixed on the decode side). Replace with a single cupy.concatenate of trimmed slabs and one .get(), then slice the host buffer by cumulative offsets to peel out per-tile payloads. The adler32 deflate-wrap step is unchanged. Real-world benchmark on an Ampere-class GPU: 2048x2048 float32 zstd GPU write at tile_size=64 (1024 tiles) drops median from 84.3ms to 54.7ms (~35% reduction). Tests cover the structural change (regressions back to the per-tile patterns fail loudly) plus end-to-end round-trip equality at deflate and zstd compression. _check_gpu_memory now bounds the new contiguous allocation in the same way the decode-side fix does.

- Add _check_gpu_memory guard before nvcomp staging-buffer concatenate (mirrors _batched_d2h_to_bytes pattern) - Tighten GPU skip to require cupy.cuda.is_available()

github-actions Bot added the performance PR touches performance-sensitive code label May 12, 2026

brendancol requested a review from Copilot May 12, 2026 19:14

Copilot started reviewing on behalf of brendancol May 12, 2026 19:15 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

brendancol force-pushed the deep-sweep-performance-geotiff-2026-05-12-aa0e4ef9-03 branch from 958e8f4 to 14cf257 Compare May 12, 2026 19:29

Address Copilot review feedback on #1729

02b80b5

- Add _check_gpu_memory guard before nvcomp staging-buffer concatenate (mirrors _batched_d2h_to_bytes pattern) - Tighten GPU skip to require cupy.cuda.is_available()

brendancol merged commit 635246c into main May 12, 2026
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(geotiff): batch _nvcomp_batch_compress allocations and D2H readback (#1712)#1729

perf(geotiff): batch _nvcomp_batch_compress allocations and D2H readback (#1712)#1729
brendancol merged 2 commits into
mainfrom
deep-sweep-performance-geotiff-2026-05-12-aa0e4ef9-03

brendancol commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brendancol commented May 12, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants